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Abstract 

I We present in this paper a novel approach for training deterministic 

^s^j auto-encoders. We show that by adding a well chosen penalty term to the 

classical reconstruction cost function, we can achieve results that equal 
i i or surpass those attained by other regularized auto-encoders as well as 

'J denoising auto-encoders on a range of datasets. This penalty term cor- 

responds to the Frobenius norm of the Jacobian matrix of the encoder 
CZ2 activations with respect to the input. We show that this penalty term re- 

suits in a localized space contraction which in turn yields robust features 
on the activation layer. Furthermore, we show how this penalty term 
_ is related to both regularized auto-encoders and denoising encoders and 

how it can be seen as a link between deterministic and non-deterministic 
C*") auto-encoders. We find empirically that this penalty helps to carve a rep- 

resentation that better captures the local directions of variation dictated 
by the data, corresponding to a lower-dimensional non-linear manifold, 
while being more invariant to the vast majority of directions orthogonal 
to the manifold. Finally, we show that by using the learned features to 
initialize a MLP, we achieve state of the art classification error on a range 
of datasets, surpassing other methods of pre-training. 

• !-h 1 Introduction 

x 

A recent topic of interest]^] in the machine learning community is the develop- 
ment of algorithms for unsupervised learning of a useful representation. This 
automatic discovery and extraction of features is often used in building a deep 
hierarchy of features, within the contexts of supervised, semi-supervised, or un- 



supervised modeling. See Bengio ( 2009 ) for a recent review of Deep Learning 



algorithms. Most of these methods exploit as basic building block algorithms 
for learning one level of feature extraction: the representation learned at one 
level is used as input for learning the next level, etc. The objective is that these 
representations become better as depth is increased, but what defines a good 
representation? It is fairly well understood what PC A or ICA do, but much 



1 see NIPS'2010 Workshop on Deep Learning and Unsupervised Feature Learning 
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remains to be done to understand the characteristics and theoretical advan- 



tages of the representations learned by a Restricted Boltzmann Machine Hinton 



et al. 


(200£ 


), an auto-encoder '. 


3engio et al. ( 


2007 


), sparse coding 


Olshausen 


and Field 


( 


1997 


); Ranzato et al. 


(2007 


); 


Kavukcuoglu et al. 


(2009 


); 


Zeiler et al. 


(2010 


, or semi-supervised embedding 


Weston et al. 


(2008 


|. All of these produce 



a non-linear representation which, unlike that of PCA or ICA, can be stacked 
(composed) to yield deeper levels of representation. It has also been observed 
empirically Lee et al. ( 2009 ) that the deeper levels often capture more abstract 
features (such as parts of objects) defined in terms of less abstract ones (such 
as sub-parts of objects or low- level visual features like edges), and that these 
features are generally more invariant Goodfellow et al. ( 2009[ ) to changes in the 
known factors of variation in the data (such as geometric transformations in the 
case of images). A simple approach, used here, to empirically verify that the 
learned representations arc useful, is to use them to initialize a classifier (such 
as a multi-layer neural network), and measure classification error. Many experi- 



ments show that deeper models can thus yield lower classification error ( Bengio 
et aL| |2007| |Jarrett et aT\ |2009| [Vincent et aL[|2008 |. 



Contribution. What principles should guide the learning of such interme- 
diate representations? They should capture as much as possible of the informa- 
tion in each given example, when that example is likely under the underlying 
generating distribution. That is what auto-encoders Vincent et al. (2008) and 
sparse coding aim to achieve when minimizing reconstruction error. 

We would also like these representations to be useful in characterizing the 
input distribution, and that is what is achieved by directly optimizing a gen- 
erative model's likelihood (such as RBMs), or a proxy, such as Score Match- 
ing Hyvarinen (2005). In this paper, we introduce a penalty term that could 
be added to either of the above contexts, which encourages the intermediate 
representation to be robust to small changes of the input around the train- 
ing examples. We show through comparative experiments on many benchmark 
datasets that this characteristic is useful to learn representations that help train- 
ing better classifiers. Previous work has shown that deep learners can discover 
representations whose features are invariant to some of the factors of variation 
in the input (Goodfellow et al., 2009). It would be nice to move further in 



this direction, towards representation learning algorithms which help to disen- 
tangle the factors of variation that underlie the data generating distribution. 
We hypothesize that whereas the proposed penalty term encourages the learned 
features to be locally invariant without any preference for particular directions, 
when it is combined with a reconstruction error or likelihood criterion we obtain 
invariance in the directions that make sense in the context of the given training 
data, i.e., the variations that are present in the data should also be captured 
in the learned representation, but the other directions may be contracted in the 
learned representation. 



2 How to extract robust features 

Most successful modern approaches for building deep networks begin by ini- 
tializing each layer in turn, using a local unsupervised learning technique, to 
extract potentially useful features for the next layer. When used as feature ex- 
tractors in this fashion, both RBMs and various flavors of auto-encoders lead 
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to a non-linear feature extractor of the exact same form: a linear mapping fol- 
lowed by a sigmoid non-linearitjj^] From this perspective, these algorithms are 
but different unsupervised techniques to learn the parameters of a mapping of 
that form. It is not yet fully understood what properties of such a mappings 
contribute to superior classification performance (for classifiers initialized with 
the produced features) . It has been argued that mappings that produce a sparse 
representation are to be encouraged, which inspired several variants of sparse 
auto-encoders. 

The research we present here is motivated by a different property: our work- 
ing hypothesis is that a good representation of a likely input (under the unknown 
data distribution) should be expected to remain rather stable (i.e. be robust, 
invariant, insensitive) under tiny perturbations of that input. This prompts us 
to propose an alternative regularization term for auto-encoders. 

To encourage robustness of the representation f(x) obtained for a training 
input x we propose to penalize its sensitivity to that input, measured as the 
Frobenius norm of the Jacobian Jf(x) of the non- linear mapping. Formally, 
if input x G IR dx is mapped by encoding function / to hidden representation 
h 6 IR dh , this sensitivity penalization term is the sum of squares of all partial 
derivatives of the extracted features with respect to input dimensions: 

\\Jf(x)\\ 2 F = J2 

Penalizing || Jf\\p encourages the mapping to the feature space to be contractive 
in the neighborhood of the training data. This geometric perspective, which 
gives its name to our algorithm, will be further elaborated on, in section |5.3[ 
based on experimental evidence. The flatness induced by having low valued first 
derivatives will imply an invariance or robustness of the representation for small 
variations of the input. Thus in this study, terms like invariance, (in-)sensitivity, 
robustness, flatness and contraction all point to the same notion. 

While such a Jacobian term alone would encourage mapping to a useless 
constant representation, it is counterbalanced in auto-encoder training by the 
need for the learnt representation to allow a good reconstruction of the inputF] 



V dxi J 



3 Auto-encoders variants 



In its simplest form, an auto-encoder (AE) is composed of two parts, an encoder 
and a decoder. It was introduced in the late 80's Rumelhart et al. (1986); Baldi 



and Hornik ( 1989 ) as a technique for dimensionality reduction, where the output 



of the encoder represents the reduced representation and where the decoder is 
tuned to reconstruct the initial input from the encoder's representation through 
the minimization of a cost function. More specifically when the encoding activa- 
tion functions are linear and the number of hidden units is inferior to the input 



2 This corresponds to the encoder part of the traditional auto-encoder neural-network and 
its regularized variants. In RBMs, the conditional expectation of the hidden layer given the 
visible layer has the exact same form. 

3 Using also the now common additional constraint of encoder and decoder sharing the 
same (transposed) weights, which precludes a mere global contracting scaling in the encoder 
and expansion in the decoder. 

4 A likelihood-related criterion would also similarly prevent a collapse of the representation. 
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dimension (hence forming a bottleneck), it has been shown that the learnt pa- 
rameters of the encoder are a subspace of the principal components of the input 
space Baldi and Hornik (1989). With the use of non-linear activation functions 



an AE can however be expected to learn more useful feature-detectors than what 



can be obtained with a simple PCA ( Japkowicz et al. 2000 ). Moreover, contrary 



to their classical use as dimensionality-reduction techniques, in their modern in- 
stantiation auto-encoders are often employed in a so-called over-complete setting 
to extract a number of features larger than the input dimension, yielding a rich 
higher-dimensional representation. In this setup, using some form of regulariza- 
tion becomes essential to avoid uninteresting solutions where the auto-encoder 
could perfectly reconstruct the input without needing to extract any useful fea- 
ture. This section formally defines the auto-encoder variants considered in this 
study. 

Basic auto-encoder (AE). The encoder is a function / that maps an input 
x € IR x to hidden representation h(x) € IR dh . It has the form 

h = f(x) = s f (Wx + b h ), (2) 

where s/ is a nonlinear activation function, typically a logistic sigmoid(,z) = 
. The encoder is parametrized by a d/, x d x weight matrix W, and a bias 
vector b h G IR d ". 

The decoder function g maps hidden representation h back to a reconstruc- 
tion y: 

y = g(h) = s g (W'h + b y ), (3) 

where s g is the decoder's activation function, typically either the identity (yield- 
ing linear reconstruction) or a sigmoid. The decoder's parameters are a bias 
vector by € IR d *, and matrix W' . In this paper we only explore the tied weights 
case, in which W — W T . 

Auto-encoder training consists in finding parameters 9 — {W,bh,b y } that 
minimize the reconstruction error on a training set of examples D„ , which cor- 
responds to minimizing the following objective function: 

Jae(0)= ]T L(x,g(f(x))), (4) 

where L is the reconstruction error. Typical choices include the squared er- 
ror L(x, y) = \\x—y\\ 2 used in cases of linear reconstruction and the cross-entropy 
loss when s g is the sigmoid (and inputs are in [0, 1]): L(x, y) = — Y2%=i x i l°g(2/«)+ 
(l-a3i)log(l-yi). 

Regularized auto-encoders (AE+wd). The simplest form of regular- 
ization is weight-decay which favors small weights by optimizing instead the 
following regularized objective: 

^AE+wdW = f J2 L{x,g{f{x)))\ ( 5 ) 

\x€D n ) ij 

where the A hyper-parameter controls the strength of the regularization. 

Note that rather than having a prior on what the weights should be, it is 
possible to have a prior on what the hidden unit activations should be. From 
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this viewpoint, several techniques have been developed to encourage sparsity of 
the representation ( Kavukcuoglu et aT| |2008 Lee et al. 20081. 

Denoising Auto-encoders (DAE) . A successful alternative form of regu- 
larization is obtained through the technique of denoising auto-encoders (DAE) 
put forward by Vincent et al. (2008 2010), where one simply corrupts input x 



before sending it through the auto-encoder, that is trained to reconstruct the 
clean version (i.e. to denoise). This yields the following objective function: 



J7dae(#) = ^2 E x^q(x\x)[ L (x,g(f(x)))], 



(6) 



xeD„ 



where the expectation is over corrupted versions x of examples x obtained from 
a corruption process q(x\x). This objective is optimized by stochastic gradient 
descent (sampling corrupted examples). 

Typically, we consider corruptions such as additive isotropic Gaussian noise: 
x = x + e, e ~ A/"(0, <J 2 I) and a binary masking noise, where a fraction v of 
input components (randomly chosen) have their value set to 0. The degree of 
the corruption (a or v) controls the degree of regularization. 



4 Contracting auto-encoders (CAE) 

From the motivation of robustness to small perturbations around the training 
points, as discussed in section [2j we propose an alternative regularization that 
favors mappings that are more strongly contracting at the training samples (see 
section |5.3 for a longer discussion). The Contracting auto-encoder (CAE) is 



obtained with the regularization term of eq. [T] yielding objective function 

Jcae(0)= £ ( L ( x >9(f(x))) + \\\Jf(x)\\%) (7) 

Relationship with weight decay. It is easy to see that the squared 
Frobenius norm of the Jacobian corresponds to a L2 weight decay in the case 
of a linear encoder (i.e. when Sf is the identity function). In this special case 
iTcae and vTAE+wd are identical. Note that in the linear case, keeping weights 
small is the only way to have a contraction. But with a sigmoid non-linearity, 
contraction and robustness can also be achieved by driving the hidden units to 
their saturated regime. 

Relationship with sparse auto-encoders. Auto-encoder variants that 
encourage sparse representations aim at having, for each example, a majority 
of the components of the representation close to zero. For these features to 
be close to zero, they must have been computed in the left saturated part of 
the sigmoid nonlinearity, which is almost flat, with a tiny first derivative. This 
yields a corresponding small entry in the Jacobian Jf(x). Thus, sparse auto- 
encoders that output many close-to-zero features, are likely to correspond to 
a highly contractive mapping, even though contraction or robustness are not 
explicitly encouraged through their learning criterion. 

Relationship with denoising auto-encoders. Robustness to input per- 
turbations was also one of the motivation of the denoising auto-encoder, as 



stated in Vincent et al. (2010). The CAE and the DAE differ however in the 



following ways: 
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• CAEs explicitly encourage robustness of representation f(x), whereas 
DAEs encourages robustness of reconstruction (go f){x) (which may only 
partially and indirectly encourage robustness of the representation, as 
the invariance requirement is shared between the two parts of the auto- 
encoder). We believe that this property make CAEs a better choice than 
DAEs to learn useful feature extractors. Since we will use only the encoder 
part for classification, robustness of the extracted features appears more 
important than robustness of the reconstruction. 

• DAEs' robustness is obtained stochastically (eq. [6]) by having several ex- 
plicitly corrupted versions of a training point aim for an identical recon- 
struction. By contrast, CAEs' robustness to tiny perturbations is obtained 
analytically by penalizing the magnitude of first derivatives || Jf(x)|||. at 
training points only (eq. [7| . 



4.1 Analytical link between denoising auto-encoders and 
contracting auto-encoders 

If the noise used in the DAE is Gaussian, its effect can be approximated an- 
alytically with an added penality term on a standard autoencoder cost func- 
tiorlBishopI (I1995I) and |Km gma and LeCun] ( 2010[ ). Let us define C(0, x) as the 
loss function of the auto-encoder. We can write the expected cost of our loss as: 



(8) 



C(0) = / £(x,6)p(x)dx 



The empirical cost can be written by expressing our probabilty distribution 
as a series of dirac functions centered on our samples: 



r i " 

C clcan (0) = / C(x, 6)6{xi - x)dx =- ^2 C ( x *> °) 



(9) 



If we were to express out density p(x) as a series of Gaussian kernels centered 
on our samples and with diagonal covariance matrix, our empirical cost function 
would become: 



n 



(10) 



In the context of a de-noising auto-encoder, we approximate this cost by 
using samples corrupted with a Gaussian noise. We can express the difference 
between these two costs as a function. Our goal is to find an approximation for 
this function. 



C nomy (e) =C clean (6») + 



(11) 



and thus: 



1 n 
n ^-r 1 



C(x 7 0)J\f Xi (J 2 (x)dx — C(xi, 9) 



(12) 



We can write the term inside the integral by defining the noise term e 
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D = J C{xi + e)No t ^{e)dE- C{ Xi ) (13) 
We can approximate the first term by a Taylor series: 

C(x + e)= C(x) + (Jc(x),e) + ^s T .H c (x).e + o(e) (14) 

By using this approximation and simplifying our integral, we obtain the 
following expression for our function: 

2 n 

m^^Y.^ 11 ^)) ( 15 ) 

i 

2 n 

C noisc (8) « C clcan (6>) + ^- Tr(H c -(6>)) (16) 

i 

The de-noising autoencoder cost function can thus be approximated by us- 
ing a classical auto-encoder with a penality on the trace of the Hessian of the 
cost fucntion versus the inputs. When the cost function is the MSE of the 
reconstruction, we can write the Hessian as: 



ff '« = 1(1; (» » 2 )) 



- 4 <*«*»-*>) 

= 2H gof (x) (g(f(x)) -x) + 2 ( J gof (x) T , J gof {x)) 
By taking the trace of the above results, we get: 

Tr(H L (x)) =2(g(f(x))-x)Tr(H gof (x)) + 2\\J gof (x)\\ 2 F (18) 

The first term of the equation scales with the reconstruction cost and will 
diminish accordingly. The second term of the equation is the Froebenius norm of 
the Jacobian. Note however that the Jacobian in this case is on the reconstruction 
and not on the representation as with our proposed penality. 

Why Jf(x) is a better choice than J go f(x) . Adding Gaussian noise dur- 
ing the training of an auto-encoder is thus asymptotically equivalent to adding 
a regularization term to the objective function. In the DAE setting with MSE 
cost, the penalty term is the norm of the Jacobian of the reconstruction units 
with respect to the input units. This encourages the output to be invariant to 
small changes in the inputs. Note that the invariance requirement is shared be- 
tween the two parts of the auto-encoder and not explicitly on the representation. 
If the goal of the auto-cncodcr is to initialise the weights of a Multi-Layer Per- 
ceptron (MLP), we do not care about the invariance of the reconstruction since 
we only use the representation. We would like the invariances captured by the 
autoecnoder to be predominantly on the representation. By using as a penality 
the norm of the Jacobian of the representation, we are doing this explicitly. 

The other drawback of having a regularization over J g0 f(x) comes from an 
optimization point of view. The gradient of the total cost function with respect 
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to the parameters of the encoder W is a function of J g (f). Since J go f(x) — 
J g (f).J f{x), minimizing J go f{x) by reducing J g (f) could lead to difficulties in 
minimizing the reconstruction cost. 



5 Experiments and results 

Considered models. In our experiments, we compare the proposed Contract- 
ing Auto Encoder (CAE) against the following models for unsupervised feature 
extraction: 

• RBM-binary : Restricted Boltzmann Machine trained by Contrastive Di- 
vergence, 

• AE: Basic auto-encoder, 

• AE+wd: Auto-encoder with weight-decay regularization, 

• DAE-g: Denoising auto-encoder with Gaussian noise, 

• DAE-b: Denoising auto-encoder with binary masking noise, 

All auto-encoder variants used tied weights, a sigmoid activation function 
for both encoder and decoder, and a cross-entropy reconstruction error (see Sec- 
tion [3]). They were trained by optimizing their (regularized) objective function 
on the training set by stochastic gradient descent. As for RBMs, they were 
trained by Contrastive Divergence. 

These algorithms were applied on the training set without using the labels 
(unsupervised) to extract a first layer of features. Optionally the procedure 
was repeated to stack additional feature-extraction layers on top of the first 
one. Once thus trained, the learnt parameter values of the resulting feature- 
extractors (weight and bias of the encoder) were used as initialisation of a mul- 
tilayer perceptron (MLP) with an extra random-initialised output layer. The 
whole network was then fine-tuned by a gradient descent on a supervised objec- 
tive appropriate for classification^] using the labels in the training set. 

Datasets used. We have tested our approach on a benchmark of image 
classification problems, namely: 

• CIFAR-10: the image-classification task (32 x 32 x 3 channels RGB) 



(Krizhevsky and Hinton 2009). 

CIFAR-bw. a gray-scale version of the original CIFAR-10. The gray-scale 
versions were obtained with a color weighting of 0.3 for red, 0.59 for green 
and 0.11 for blue. 

MNIST: the well-known digit classification problem (28 x 28 gray-scale 
pixel values scaled to [0,1]). It has 50000 examples for training, 10000 for 
validation, and 10000 for test. 



Six harder digit recognition problems used in the benchmark of Larochelle 



et al. ( 2007 They were derived by adding extra factors of variation to MNIST 
digits. Each has 10000 examples for training, 2000 for validation, 50000 for test. 

• basic: Smaller subset of MNIST. 

• rot: digits with added random rotation. 

• bg-rand: digits with random noise background. 



bg-img: digits with random image background. 



5 We used sigmoid+cross-entropy for binary classification, and log of softmax for multi-class 
problems 



61 



Datasets available at http://www.iro.umontreal.ca/~lisa/icml2007 



8 



bg-img-rot: digits with rotation and image background. 



Two artificial shape classification problems from the benchmark of Larochelle 



et al. (20071: 



rect: Discriminate between tall and wide rectangles (white on black). 
rect-img: Discriminate between tall and wide rectangular image on a dif- 
ferent background image. 



5.1 Classification performance 
5.1.1 MNIST and CIFAR-bw 

Our first series of experiments focuses on the MNIST and CIFAR-bw datasets. 
We compare the classification performance obtained by a neural network with 
one hidden layer of 1000 units, initialized with each of the unsupervised algo- 
rithms under consideration. For each case, we selected the value of hyperpa- 
rameters (such as the strength of regularization) that yielded, after supervised 
fine-tuning, the best classification performance on the validation set. Final clas- 
sification error rate was then computed on the test set. With the parameters 
obtained after unsupervised pre-training (before fine-tuning) , we also computed 
in each case the average value of the encoder's contraction ||Jy(x)||^ on the 
validation set, as well as a measure of the average fraction of saturated units 
per exampl^] These results are reported in Table [TJ We see that the local 
contraction measure (the average || J/||f) on the pre-trained model strongly 
correlates with the final classification error. The CAE, which explicitly tries 
to minimize this measure while maintaining a good reconstruction, is the best- 
performing model, datasets. 

Results given in Table [2] compare the performance of stacked CAEs on the 



benchmark problems of Larochelle et al. (2007) to the three-layer models re- 



ported in Vincent et al. (2010). Stacking a second layer CAE on top of a first 
layer appears to significantly improves performance, thus demonstrating their 
usefulness for building deep networks. Moreover on the majority of datasets, 
2-layer CAE beat the state-of-the-art 3-layer model. 



5.1.2 CIFAR-10 

The pipeline of preprocessing steps we used here is similar to ?. We randomly 
extracted 160000 patches 8x8 from the 10000 first images of CIFAR-10. For 
each patch, we substract the mean and divide by the standard deviation ( local 
contrast normalization). Then, a PCA is fitted on this set of patches. The 
2 first components (corresponding to black patches) are dropped but we kept 
the next 80 first components (over 192). For building the final training set of 
patches, we project these patches on the PCA components, perform whitening 
i.e divide by the eigen values, and pass it through a logistic funtion in order to 
map it to [0, 1]. 

A Contracting- Auto-Encoder with a number of hidden units rihid G {50, 100, 200, 400} 
is trained on this set by minimizing the cross-entropy reconstruction error and 
the regularizer with stochastic gradient descent. We present some filters learned 
during this process in Figure XX. 

7 We consider a unit saturated if its activation is below 0.05 or above 0.95. Note that in 
general the set of saturated units is expected to vary with each example. 
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Model 

1V1WUC1 


Test 
error 


Average 

\\M X )\W 


SAT 




CAE 


1.14 


0.73 io- 4 


86.36% 


EH 


DAE-g 


1.18 


0.86 io- 4 


17.77% 


M 


RBM-binary 


1.30 


2.50 io- 4 


78.59% 


2 
% 


DAE-b 


1.57 


7.87 io- 4 


68.19% 


AE+wd 


1.68 


5.00 io- 4 


12.97% 




AE 


1.78 


17.5 io- 4 


49.90% 


AR-bw 


CAE 


47.86 


2.40 io" 5 


85,65% 


DAE-b 


49.03 


4.85 io- 6 


80,66% 


DAE-g 


54.81 


4.94 io- 5 


19,90% 


h- 1 


AE+wd 


55.03 


34.9 io- 5 


23,04% 


u 


AE 


55.47 


44.9 io- 5 


22,57% 



Table 1: Performance comparison of the considered models on MNIST (top 
half) and CIFAR-bw (bottom half). Results are sorted in ascending order of 
classification error on the test set. Best performer and models whose difference 
with the best performer was not statistically significant are in bold. Notice 
how the average Jacobian norm (before fine-tuning) appears correlated with 
the final test error. SAT is the average fraction of saturated units per example. 
Not surprisingly, the CAE yields a higher proportion of saturated units. 



Finally, we evaluate the classification performance of our algorithm with a 
linear classifier. The preprocessing steps are applied convolutionnaly with a 
stride equal to one to get Uhid feature maps of size 25 x 25. Features are sum- 
pooled together over the quadrants of the feature maps. By applying this coarse 
dimensionality reduction technique, we obtain features of dimension Anhid- We 
fed a linear L2-regularized SVM with these features and reported the test clas- 
sification accuracy in TAB XX. L2 regularizer was chosen using a 5-fold cross 
validation. 



5.2 Closer examination of the contraction 

To better understand the feature extractor produced by each algorithm, in terms 
of their contractive properties, we used the following analytical tools: 

What happens locally: looking at the singular values of the Ja- 
cobian. A high dimensional Jacobian contains directional information: the 
amount of contraction is generally not the same in all directions. This can be 
examined by performing a singular value decomposition of J/ . We computed the 
average singular value spectrum of the Jacobian over the validation set for the 
above models. Results are shown in Figure[2]and will be discussed in section 5.3 



What happens further away: contraction curves. The Frobenius 
norm of the Jacobian at some point x measures the contraction of the mapping 
locally at that point. Intuitively the contraction induced by the proposed penalty 
term can be measured beyond the immediate training examples, by the ratio 
of the distances between two points in their original (input) space and their 
distance once mapped in the feature space. We call this measure contraction 
ratio. In the limit where the variation in the input space is infinitesimal, this 
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Data Set 


SVM rt/ 


SAE-3 


RBM-3 


DAE-b-3 


CAE-1 


CAE-2 


basic 


3.03±o.i5 


3.46±o.i6 


3.11±0.15 


2.84±o.i5 


2.83±o.i5 


2.48±o.i4 


rot 


11.11±0.28 


10.30±0.27 


10.30±0.27 


9.53±o.26 


11.59±0.28 


9.66±o.26 


bg-rand 


14.58±o.3i 


11.28±0.28 


6.73±o.22 


10.30±0.27 


13.57±o.3o 


10.90 ±0.27 


bg-img 


22.61±0.379 


23.00±o.37 


16.31±0.32 


16.68±o.33 


16.70±0.33 


15.50±0.32 


bg-img-rot 


55.18±o.44 


51.93±o.44 


47.39±o.44 


43.76±o.43 


48.10±0.44 


45.23±o.44 


red 


2.15±o.i3 


2.41±o.i3 


2.60±o.i4 


1.99±o.i2 


1.48±o.io 


1.21±o.io 


rect-img 


24.04±o.37 


24.05±o.37 


22.50±o.37 


21.59±o.36 


21.86±o.36 


21.54±o.36 



Table 2: Comparison of stacked contracting auto-encoders with 1 and 2 layers 
(CAE-1 and CAE-2) with other 3-layer stacked models and baseline SVM. Test 
error rate on all considered classification problems is reported together with a 
95% confidence interval. Best performer is in bold, as well as those for which 
confidence intervals overlap. Clearly CAEs can be successfully used to build 
top-performing deep networks. 2 layers of CAE often outperformed 3 layers of 
other stacked models. 

corresponds to the derivative (i.e. Jacobian) of the representation map. 

For any encoding function /, we can measure the average contraction ratio 
for pairs of points, one of which, x is picked from the validation set, and the 
other x\ randomly generated on a sphere of radius r centered on xq in input 
space. How this average ratio evolves as a function of r yields a contraction 
curve. We have computed these curves for the models for which we reported 
classification performance (the contraction curves are however computed with 
their initial parameters prior to fine tuning). Results are shown in Figure [T] for 
single-layer mappings and in Figure [3] for 2 and 3 layer mappings. They will be 
discussed in detail in the next section. 

5.3 Discussion: Local Space Contraction 

From a geometrical point of view, the robustness of the features can be seen as a 
contraction of the input space when projected in the feature space, in particular 
in the neighborhood of the examples from the data- generating distribution: oth- 
erwise (if the contraction was the same at all distances) it would not be useful, 
because it would just be a global scaling. This is happening with the proposed 
penalty, but rarely so without it, as illustrated on the contraction curves of 
Figure [I] For all algorithms tested except the proposed CAE and the Gaussian 
corruption DAE (DAE-g), the contraction ratio decreases (i.e., towards more 
contraction) as we move away from the training examples (this is due to more 
saturation, and was expected), whereas for the CAE the contraction ratio ini- 
tially increases, up to the point where the effect of saturation takes over (the 
bump occurs at about the maximum distance between two training examples). 

Think about the case where the training examples congregate near a low- 
dimensional manifold. The variations present in the data (e.g. translation and 
rotations of objects in images) correspond to local dimensions along the man- 
ifold, while the variations that are small or rare in the data correspond to the 
directions orthogonal to the manifold (at a particular point near the manifold, 
corresponding to a particular example). The proposed criterion is trying to 
make the features invariant in all directions around the training examples, but 
the reconstruction error (or likelihood) is making sure that that the represen- 
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tation is faithful, i.e., can be used to reconstruct the input example. Hence the 
directions that resist to this contracting pressure (strong invariance to input 
changes) are the directions present in the training set. Indeed, if the variations 
along these directions present in the training set were not preserved, neighboring 
training examples could not be distinguished and properly reconstructed. Hence 
the directions where the contraction is strong (small ratio, small singular values 
of the Jacobian matrix) are also the directions where the model believes that 
the input density drops quickly, whereas the directions where the contraction is 
weak (closer to 1, larger contraction ratio, larger singular values of the Jacobian 
matrix) correspond to the directions where the model believes that the input 
density is flat (and large, since we are near a training example). 

We believe that this contraction penalty thus helps the learner carve a kind 
of mountain supported by the training examples, and generalizing to a ridge 
between them. What we would like is for these ridges to correspond to some 
directions of variation present in the data, associated to underlying factors of 
variation. How far do these ridges extend around each training example and 
how flat are they? This can be visualized comparatively with the analysis of 
Figure [T] with the contraction ratio for different distances from the training 
examples. 

Note that different features (elements of the representation vector) would be 
expected to have ridges (i.e. directions of invariance) in different directions, and 
that the "dimensionality" of these ridges (we are in a fairly high-dimensional 
space) gives a hint as to the local dimensionality of the manifold near which the 
data examples congregate. The singular value spectrum of the Jacobian informs 
us about that geometry. The number of large singular values should reveal 
the dimensionality of these ridges, i.e., of that manifold near which examples 
concentrate. This is illustrated in Figure[2j showing the singular values spectrum 
of the encoder's Jacobian. The CAE by far does the best job at representing 
the data variations near a lower-dimensional manifold, and the DAE is second 
best, while ordinary auto-encoders (regularized or not) do not succeed at all in 
this respect. 

What happens when we stack a CAE on top of another one, to build a deeper 
encoder? This is illustrated in Figure [3j which shows the average contraction 
ratio for different distances around each training point, for depth 1 vs depth 
2 encoders. Composing two CAEs yields even more contraction and even more 
non-linearity, i.e. a sharper profile, with a flatter level of contraction at short 
and medium distances, and a delayed effect of saturation (the bump only comes 
up at farther distances) . We would thus expect higher-level features to be more 
invariant in their feature-specific directions of invariance, which is exactly the 
kind of property that motivates the use of deeper architectures. 

6 Conclusion 

In this paper, we attempt to answer the following question: what makes a 
good representation?. Besides being useful for a particular task, which we can 
measure, or towards which we can train a representation, this paper highlights 
the advantages for representations to be locally invariant in many directions of 
change of the raw input. This idea is implemented by a penalty on the Frobe- 
nius norm of the Jacobian matrix of the encoder mapping, which computes the 
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representation. The paper also introduces empirical measures of robustness and 
invariance, based on the contraction ratio of the learned mapping, at different 
distances and in different directions around the training examples. We hypoth- 
esize that this reveals the manifold structure learned by the model, and we find 
(by looking at the singular value spectrum of the mapping) that the Contracting 
Auto-Encoder discovers lower-dimensional manifolds. In addition, experiments 
on many datasets suggest that this penalty always helps an auto-encoder to 
perform better, and competes or improves upon the representations learned by 
Denoising Auto-Encoders or RBMs, in terms of classification error. 
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Figure 1: Contraction curves obtained with the considered models on MNIST 
(top) and CIFAR-bw (bottom) . See the main text for a detailed interpretation. 
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Figure 2: Average spectrum of the encoder's Jacobian, for the CIFAR-bw 
dataset. Large singular values correspond to the local directions of "allowed" 
variation learnt from the dataset. The CAE having fewer large singular val- 
ues and a sharper decreasing spectrum, it suggests that it does a better job of 
characterizing a low- dimensional manifold near the training examples. 
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Figure 3: Contraction effect as a function of depth. Deeper encoders produce 
features that are more invariant, over a farther distance, corresponding to flatter 
ridge of the density in the directions of variation captured. 
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